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Abstract 

We propose to store several integers modulo a small prime into a sin- 
gle machine word. Modular addition is performed by addition and possi- 
bly subtraction of a word containing several times the modulo. Modular 
Multiplication is not directly accessible but modular dot product can be 
performed by an integer multiplication by the reverse integer. Modular 
multiplication by a word containing a single residue is a also possible. 
Therefore matrix multiplication can be performed on such a compressed 
storage. We here give bounds on the sizes of primes and matrices for 
which such a compression is possible. We also explicit the details of the 
required compressed arithmetic routines. 



1 Introduction 

Compression of matrices over fields of characteristic 2 is naturally made via the 
binary representation of machine integers [TJ [5] • 

The FFLAS/FFPACK project has demonstrated the need of a wrapping of 
cache-aware routines for efficient small finite field linear algebra [H [5] . 
Therefore, a conversion between a modular representation of prime fields of 
any (small) characteristic and e.g. floating points can be performed via the 
homomorphism to the integers [2] . In [5] it is proposed to transform polynomial 
over a prime field into a Q-adic representation where Q is an integer than 
the field characteristic. We call this transformation DQT for Discrete Q-adic 
Transform. With some care, in particular on the size of Q, it is possible to map 
the polynomial operations into the floating point arithmetic realization of this 
Q-adic representation and convert back using an inverse DQT. 
Efficient matrix computations over very small finite fields of characteristic other 
than two are required e.g. to study strongly regular graphs [5], in order to 
prove/disprove and help in the comprehension of the conjectures of |10j . 
In this note we propose to use this fast polynomial arithmetic within machine 
words to compute dot products. We show in section [3] how to recover a dot 
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product of size d + 1 can be recovered from the single coefficient of degree d of a 
polynomial product. Whenever the prime modulus is small enough this enables 
to compute several accumulations of binary products in a single machine oper- 
ation. Then we propose in section [3] an alternative matrix multiplication using 
multiplication of a compressed word by a single residue. The latter requires 
also a simultaneous modular reduction, called REDQ in [3]. In general, the 
prime field, the size of matrices and the available mantissa are given. This gives 
some constraints on the possible choices of Q and d. In both cases anyway, we 
show that these compression techniques represent a speed-up factor of up to the 
number d + 1 of residues stored in the compressed format. 



2 Q-adic compression or Dot product via poly- 
nomial multiplication 

Suppose that a(X) = Y^,i=o a iX l and b(X) = Yli=o hX 1 are two polynomials 
in Z/pZ[X]. One can perform the dot product X)i=o a i^d-i by extracting the 
coefficient of degree d of a(X)b{X). 

2.1 Modular dot product via machine word multiplication 

The idea here, as in [3J, is to replace X by an integer Q, usually a power of 2 
in order to speed up conversions. Thus the vectors of residues a = [clq . . . ad] 
and b = [&o ■ ■ • M are stored respectively as b = Yli=o ^iQ l an d the reverse 
R = Ei=o a d-iQ l - 

This is done e.g. over floating points via the following compressions: 

doublefe init3( doublefe r, 

const double u, const double v, const double w) { 
r=u; r*=_dBase; r+=v; r*=_dBase; return r+=w; 
> 



2.2 Gain 



Now for matrix multiplication A x B one wishes to convert a whole row of the 
left mx k matrix A and a whole column of the right kxn matrix B. Thus A is 

CompressedRowMatrix, CA and B is transformed 



transformed into a m x 
into a 



k 
d+1 



x n CompressedRowMatrix, CB. 

Therefore the matrix multiply CA x CB can gain a factor of d + 1 over the 
multiplication of A x B, for classical multiplication as shown on the 2x2 ex- 



ample below where the matrix product 



ae + bg 
ce + dg 



af + bh 
cf + dh 



is 



performed via integer multiplications. The precise gain will be given in table El 
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* + (ae + bg)Q + *.Q 2 * + (af + bh)Q + *.Q 2 

* + (ce + dg)Q + *.Q 2 * + (cf + dh)Q + *.Q 2 

The result matrix C — CA x CB is m X n. Thus in order to compare similar 
computations one has either to consider multiplication of compressed matrices 
which is then the procedure 

C = CAx CB; CC = Reduce AndCompress(C) (1) 

or to consider multiplication of normal matrices via compression and thus the 
procedure 

CA = CompressRows(A); C B = CompressColumns(B); C = CA x CB (2) 

2.3 Partial compression 

Note that the last column of CA and the last row of B might not have d + 1 
elements if ^-j- ^ Z. Thus one has to artificially append some zeroes to the 
converted values. On b this means just do nothing. On the reversed a this 
means multiplying by Q several times. 

2.4 Delayed reduction and lower bound on Q 

For the results to be correct the inner dot product must not exceed Q. With a 
positive modular representation mod p (i.e. integers from to p~ 1), this means 
that (d + l)(p — l) 2 < Q. Moreover, we would like to use delayed reductions 
on the intermediate results and thus accumulate the db before any modular 
reduction. It is thus possible to perform matrix multiplications of with common 
dimension k as long as: 

JL_(d + l)(p-lf = k(p-l) 2 <Q. (3) 

2.5 Available mantissa and upper bound on Q 

If the product db is performed with floating point arithmetic we just need that 
the coefficient of degree d remains fully in the mantissa (3. Write db — CffQ d +CL, 
the latter means that ch, and ch only, must remain lower that 2^. It could 
then be exactly recovered by multiplication of db by the correctly precomputcd 
and rounded inverse of Q d and floored, as shown e.g. in [3l Lemma 2]. 
With delayed reduction this means that J^Uo dTi(* + l) 2 Q d_ ' < ^ ■ We 

can use equation [3] in order to show that X)j=o rfTl(* + l) 2 Q d ~ l < Q d+1 - 

With this we just have to enforce that 

Q d+1 < 2' 3 . (4) 

Thus a single reduction has to be made at the end of the dot product as follows: 



Qa + b 
Qc + d 



[e + Qg f + 1 
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Elementfe init( Elementfe rem, const double dp) const -[ 
double r = dp; 

// Multiply by the inverse of Q~d with correct rounding 
r *= _inverseQto_d; 

// Now we just need the part less than Q=2~t 
unsigned long rl( static_cast<unsigned long>(r) ); 
rl &= _QMINUSONE; 

// And we finally perform a single modular reduction 
rl %= _modulus ; 

return rem = static_cast<Element> (rl) ; 

> 

Note that one can avoid the multiplication by the inverse of Q when Q = 2*: 
by adding Q 2d+1 to the final result one is guaranteed that the t(d + 1) high bits 
represent exactly the d+ 1 high coefficients. On the one hand, the floating point 
multiplication can be replaced by an addition. On the other hand, this doubles 
the size of the dot product and thus reduces by a factor of d+ \/2 the largest 
possible dot product size k. 

2.6 Results 

One can see on figure [1] that the compression [d + 1) is very useful for small 
primes since the gain over the double floating point routine is quite close to d. 
Indeed choosing a power of 2 for Q simplifies and speeds up conversions and 
thus gives the following compression factors modulo 3: 
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< 256 
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Table 1: Compression factors for different common matrix dimensions modulo 3, 
with 53 bits of mantissa and Q a power of 2. 

Before n = 256 the compression is at a factor of five and the time to perform a 
matrix multiplication is less than a hundredth of a second. Then from 257 to 
2048 one has a factor of 4 and the times are roughly 16 times the time of the 
four times smaller matrix, as visually shown on figure left. The same is true 
afterwards with the respective factor 3 of compression. 

Remark that the curve of fgemm with underlying arithmetic on single floats 
oscillates and drops. This is because the matrix begins to be too large and that 
modular reductions are required between the recursive matrix multiplication 
steps. Then the floating point routines are used only when the sub- 

matrices are small enough. One can see the subsequent increase in the number 
of classical arithmetic steps on the drops around 2048, 4096 and 8192. 

1 http: // www. tacc. utexas . edu / resources / software / 
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Finite field Winograd matrix multiplication with Goto BLAS on a XEON, 3.6 GHz 
25 I 1 1 1 1 




2000 4000 6000 8000 10000 

Matrix order 



Figure 1: Compressed matrices multiplication of equation [T] compared with 
dgemm (the floating point double precision routine of GotoBlas) and fgemm 
(the exact routine of FFLAS) with double or single precision. 



3 Right Compressed matrix multiplication 



Another way of performing compressed matrix multiplication is to multiply an 
uncompressed matrix m x k to the right by a row-compressed k x ^-j- ma- 
trix. A dot product with this algorithm will be of the form a — [clq, . . . , a n ] x 
Ej=o bojQ-i , ■ ■ ■ , lCj=o bnjQ-'}- Therefore, a single entry of the resulting matrix 



will be J2i 
below. 



=0 bij 



Q J 



j=oEi=o a ibij )Q J as shown on the example 



a b 


X 


e + Qf 




c d 


g + Qh 





(ae + bg) + Q{af + bh) 
(ce + dg) + Q{cf + dh) 



Here also Q and d must satisfy equations © and ([!]). 

The major difference is in the reductions. Indeed now one needs to reduce 
simultaneously the d + 1 coefficients of the polynomial in Q in order to get the 
results. This simultaneous reduction can be made by the REDQ algorithm of 
[3 Algorithm 2]. 

Thus the whole right compressed matrix multiplication over two compressed 
matrices CA and CB, is the following algorithm as shown also on figure [H 
right: 

A = Uncompress{CA);CC = AxCB; REDQ(CC) (5) 
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4 Full compression 

Of course one would like to compress simultaneously two dimensions of the 
matrix product. The required dot products can there be achieved by polynomial 
multiplication with two variables Q and 0. Let d q be the degree in Q and dg 

be the degree in O. The dot product is then: a = Ef=o a *0j ■ • ■ i X^=o a ™] x 

Ejo & <y 0J '> ■ • • » Ejo 6niQ , 'l- The latter is Eto(E-= ««)(E ?4 W*e j = 

Ei=o Ejf=o(Ef=o aubij)Q % Q : ' as shown on the example below. 



[a + Qc 6 + Qd] X 



e + Qf 
g + Qh 



[(ae + bg) + Q{ce + dg) + 0(af + bh) + QQ(cf + dh)} 



In order to guarantee that all the coefficients can be recovered independently, 
Q must still satisfy equation §3§ but then we have this additional equation on 
6: 

Q d * +1 < e (6) 



This gives thus upper bounds on d q and dg: 

Q(d q +l)(de+l) < 2 /3 



(7) 



5 Comparison 

We summarize the differences of the presented algorithms on figure [5] and [31 
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Figure 2: Algorithms of equations ([T]), left, and ([5]), right. 



We see on the one hand that the first algorithm compresses the common di- 
mension whereas the Right (or also Left) compressions compress an external 
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Figure 3: Left Compression and Full Compression 



matrix dimension. Thus in the case of rectangular matrices one can choose be- 
tween those routines the fastest one. This will be the routine compressing the 
largest dimension as shown on table [2j On the other hand the full compression 
algorithm compresses both external sizes but, as shown by equation {7} the 
available mantissa is is only shared by both compression. Therefore if we define 
the compression factor to be 



6 |>g 2 (Q)_ 

then the degree of compression for the first three algorithms is just be d = e — 1 
where it becomes d = \fe — 1 for the full compression with equal degrees for 
both variables Q and 0. For ui the exponent of matrix multiplication, the table 
[2]shows that the gain in terms of arithmetic operations is e"~ 2 for the first three 
variants and e~ for the full compression. When w = 3 for classical matrix 
multiplication the speed-up is the same. Now, when fast matrix multiplication 
is used, as e.g. in [6j §3.2], full compression performs less operations. This is 
not only of theoretical interest but also of practical value since the considered 
matrices are then less rectangular. This enables more locality for the matrix 
computations and usually better performance. 
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Table 2: Number of operations for the different algorithms 



The difference there will mainly be on the number and on the kind of modular 
reductions. Since the REDQ e reduction is faster than e classical reductions, 
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see [5] , and since INIT e and EXTRACT,, are roughly the same operations, the 
best algorithm would then be one of the Left, Right or Full compression. 
For example, with algorithm ([T]) on matrices of sizes 10000 x 10000 it took 92.75 
seconds to perform the matrix multiplication modulo 3 and 0.25 seconds to 
convert the resulting C matrix. This is less than 0.3%. For 250 x 250 matrices 
it takes less than 0.0028 seconds to perform the multiplication and roughly 
0.00008 seconds for the conversions. There, the conversions count for 3%. 
The full compression algorithm seems the better candidate for the locality and 
fast matrix multiplication reasons above ; howbeit the compression factor is an 

integer, depending on the flooring of either log ^Q) or \J \ of f^Qy Thus there are 
matrix dimensions for which the compression factor of e.g. the right compression 
will be larger than the square of the compression factor of the full compression. 
There the right compression will have some advantage over the full compression. 
Further work would thus include implementing the Right or Full compression 
and comparing the conversions overhead with that of algorithm (fljl. 
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